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Abstract It is shown that bootstrap approximations of support vector machines 
(SVMs) based on a general convex and smooth loss function and on a general ker- 
nel are consistent. This result is useful to approximate the unknown finite sample 
distribution of SVMs by the bootstrap approach. 



1 Introduction 

Support vector machines and related kernel based methods can be considered as 
a hot topic in machine learning because they have good statistical and numerical 
properties under weak assumptions and have demonstrated their often good gener- 
alization properties in many applications, see e.g. [14, 15], [10], and [12]. To our 
best knowledge, the original SVM approach by [1] was derived from the gener- 
alized portrait algorithm invented earlier by [16]. Throughout the paper, the term 
SVM will be used in the broad sense, i.e. for a general convex loss function and a 
general kernel. 

SVMs based on many standard kernels as for example the Gaussian RBF kernel 
are nonparametric methods. The finite sample distribution of many nonparamet- 
ric methods is unfortunately unknown because the distribution P from which the 
data were generated is usually completely unknown and because there are often 
only asymptotical results describing the consistency or the rate of convergence of 
such methods known so far. Furthermore, there is in general no uniform rate of 
convergence for such nonparametric methods due to the famous no-free-lunch theo- 
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rem, see [5] and [6]. Informally speaking, the no-free-lunch theorem states that, for 
sufficiently malign distributions, the average risk of any statistical (classification) 
method may tend arbitrarily slowly to zero. Theses facts are true for SVMs. SVMs 
are known to be universally consistent and fast rates of convergence are known for 
broad subsets of all probability distributions. The asymptotic normality of SVMs 
was shown recently by [8] under certain conditions. 

Here, we apply a different approach to SVMs, namely Efron's bootstrap. The 
goal of this paper is to show that bootstrap approximations of SVMs which are 
based on a general convex and smooth loss function and a general smooth kernel are 
consistent under mild assumptions; more precisely, convergence in outer probability 
is shown. This result is useful to draw statistical decisions based on SVMs, e.g. 
confidence intervals, tolerance intervals and so on. 

We mention that both the sequence of SVMs and the sequence of their cor- 
responding risks are qualitatively robust under mild assumptions, see [2]. Hence, 
Efron's bootstap approach turns out to be quite successful for SVMs from several 
aspects. 

The rest of the paper has the following structure. Section 2 gives a brief introduc- 
tion into SVMs. Section 3 gives the result. The last section contains the proof and 
related results. 



2 Support Vector Machines 

Current statistical applications are characterized by a wealth of large and high- 
dimensional data sets. In classification and in regression problems there is a variable 
of main interest, often called "output values" or "response", and a number of poten- 
tial explanatory variables, which are often called "input values". These input values 
are used to model the observed output values or to predict future output values. The 
observations consist of n pairs (x\,yi), ... , (x n ,y n ), which will be assumed to be 
independent realizations of a random pair (X,Y). We are interested in minimizing 
the risk or to obtain a function / : S£ — > & such that f(x) is a good predictor for 
the response y, if X = x is observed. The prediction should be made in an automatic 
way. We refer to this process of determining a prediction method as "statistical ma- 
chine learning", see e.g. [14, 15, 10,3, 11]. Here, by "good predictor" we mean that 
/ minimizes the expected loss, i.e. the risk, 

^i lP (/)=Ep[L(Z,y,/(Z))], 

where P denotes the unknown joint distribution of the random pair (X,Y) and 
L : X x & x R — > [0, +°°) is a fixed loss function. As a simple example, the 
least squares loss L(X,Y,f(X)) = (Y-f(X)) 2 yields the optimal predictor f(x) = 
Ep(T \X — x), x £ X '. Because P is unknown, we can neither compute nor minimize 
the risk &L,v{f) directly. 
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Support vector machines, see [16], [1], [14, 15], provide a highly versatile frame- 
work to perform statistical machine learning in a wide variety of setups. The mini- 
mization of regularized empirical risks over reproducing kernel Hilbert spaces was 
already considered e.g. by [9]. Given a kernel k : 5£ x X — > R we consider pre- 
dictors / £ H, where H denotes the corresponding reproducing kernel Hilbert space 
of functions from ^ to R. The space H includes, for example, all functions of 
the form f(x) = L;"=i ctjk(x,Xj) where xj are arbitrary elements in S£ and a, £ R, 
1 < J f < tn. To avoid overfitting, a support vector machine f^pi is defined as the 
solution of a regularized risk minimization problem. More precisely, 

f L ?,X = arginf EpL(X,Y,f(X)) + A \\f\\%, (D 

where A £ (0,°°) is the regularization parameter. For a sample D = ((JCi,yi), , 

(x n ,y n )) the corresponding estimated function is given by 

f L .D„,l = arginf I £L (*«,*,/(*,)) + A \\f\\ 2 H , (2) 
feH n f--\ 

where D„ denotes the empirical distribution based on D (see (3) below). Note that 
the optimization problem (2) corresponds to (1) when using D„ instead of P. 

Efficient algorithms to compute /„ := /lb X eXiS t for a number of different loss 
functions. However, there are often good reasons to consider other convex loss func- 
tions, e.g. the hinge loss L(X ,Y,f(X)) = max{l — Y ■ f(X),Q} for binary classifi- 
cation purposes or the e-insensive loss L(X ,Y,f(X)) = max{0, — f(X)\ — e} for 
regression purposes, where e > 0. As these loss functions are not differentiable, the 
logistic lossfuncfionsL(X,y,/(X)) =ln(l+exp(-F ■ f{X))) and L(X, Y,f(X)) = 
— ln(4e F ~-^^' /(l +e Y ~f( x%> ) 2 ) and Huber-type loss functions are also used in prac- 
tice. These loss functions can be considered as smoothed versions of the previous 
two loss functions. 

An important component of statistical analyses concerns quantifying and incor- 
porating uncertainty (e.g. sampling variability) in the reported estimates. For ex- 
ample, one may want to include confidence bounds along the individual predicted 
values f n (xi) obtained from (2). Unfortunately, the sampling distribution of the es- 
timated function /„ is unknown. Recently, [8] derived the asymptotic distribution of 
SVMs under some mild conditions. Asymptotic confidence intervals based on those 
general results are always symmetric. 

Here, we are interested in approximating the finite sample distribution of SVMs 
by Efron's bootstrap approach, because confidence intervals based on the bootstrap 
approach can be asymmetric. The bootstrap [7] provides an alternative way to esti- 
mate the sampling distribution of a wide variety of estimators. To fix ideas, consider 
a functional S : ^# — > W, where ^ is a set of probability measures and W denotes a 
metric space. Many estimators can be included in this framework. Simple examples 
include the sample mean (with functional S(P) = f ZdP) and M-estimators (with 
functional defined implicitly as the solution to the equation Epf (Z,S(P)) = 0). Let 
be the Borel <7-algebra on f = f x ^ and denote the set of all Borel 
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probability measures on by JC\ {3?,3S{2?)). Then, it follows that (1) 

defines an operator 

S:JKi{3r,Sg{2r))^H, S(P)=f L , PA , 
i.e. the support vector machine. Moreover, the estimator in (2) satisfies 

f L ,D n ,X=S(B n ) 

where 

D » = ^E^,,v,) 0) 

is the empirical distribution based on the sample D = ((xi,yi),...,(x n ,y n )) and 
SfeyA denotes the Dirac measure at the point (x^y,). 

More generally, let Z, = (X,-,F;), ; = 1, . . . ,n, be independent and identically dis- 
tributed (i.i.d.) random variables with distribution P, and let 

S n (Z 1 ,...,Z n )=S(V n ) 

be the corresponding estimator, where 

1 " 

Denote the distribution of 5(P„) by j£f„(S;P) = J§f(5(P„)). If P was known to us, 
we could estimate this sampling distribution by drawing a large number of random 
samples from P and evaluating our estimator on them. The basic idea of Efron's 
bootstrap approach is to replace the unknown distribution P by an estimate P. Here 
we will consider the natural non-parametric estimator given by the sample empiri- 
cal distribution P„. In other words, we estimate the distribution of our estimator of 
interest by its sampling distribution when the data are generated by P„. In symbols, 
the bootstrap proposes to use 

Since this distribution is generally unknown, in practice one uses Monte Carlo sim- 
ulation to estimate it by repeatedly evaluating the estimator on samples drawn from 
D„. Note that drawing a sample from D„ means that n observations are drawn with 
replacement from the original n observations (x\ ,y\), . . . , (x n ,y n ). 



3 Consistency of Bootstrap SVMs 



In this section it will be shown under appropriate assumptions that the weak con- 
sistency of bootstrap estimators carries over to the Hadamard-differentiable SVM 
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functional in the sense that the sequence of "conditional random laws" (given 
(Xi,Y\), {X2,Y-i),. . .) of y/n(f L f i — /l,p„,a) i s asymptotically consistent in proba- 
bility for estimating the laws of the random elements yfn(fL,e„,i — /l p X )■ I n other 
words, if n is large, the "random distribution" 

^(^(/l,#„,A-/w)) (4) 

based on bootstrapping an SVM can be considered as a valid approximation of the 
unknown finite sample distribution 

^(MfL.P„.X-fL^))- (5) 

Assumption 1 Let I" C I 1 * te closed and bounded and let & C R be closed. 
Assume that k : 2£ X 2£ — > R is the restriction of an m-times continuously dif- 
ferentiable kernel k : R d x W 1 — > R such that m > d/2 and k ^ 0. Let H be the 
RKHS ofk and let P be a probability distribution on {3£ x & \S8{3C x <&)). Let 
L:ifx^ xE-> [0,°»)kfl convex, P-square-integrable Nemitski loss function of 
order p G [1,°°) such that the partial derivatives 

L'(x,y,t):=-^(x,y,t) and L" (x,y,t) := ~^{x,y,t) 

exist for every {x,y,t) eJxf xR Assume that the maps 

(x,y,t) i-> L'(x,y,t) and (x,y,t) i-> L" \x,y,t) 

are continuous. Furthermore, assume that for every a £ (0,°°), there is a b' a € ^(P) 
and a constant b" G [0,°°) such that, for every (x,y) £ x W, 

sup \L'(x,y,t)\ < b' a (x,y) and sup \L"(x,y,t)\ < b" , (6) 

t£[-a,a\ te[-a,a] 

The conditions on the kernel k in Assumption 1 are satisfied for many common 
kernels, e.g., Gaussian RBF kernel, exponential kernel, polynomial kernel, and lin- 
ear kernel, but also Wendland kernels k c i j based on certain univariate polynomials 
p d j of degree [d/2\ +3£+ 1 for t G IN such that £>d/4, see [17]. 

The conditions on the loss function L in Assumption 1 are satisfied, e.g., for the 
logistic loss for classification or for regression, however the popular non-smooth 
loss functions hinge, e-insensitive, and pinball are not covered. However, [8, Re- 
mark 3.5] described an analytical method to approximate such non-smooth loss 
functions up to an arbitrarily good precision e > by a convex P-square integrable 
Nemitski loss function of order p € [1,°°). 

We can now state our result on the consistency of the bootstrap approach for 
SVMs. 
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Theorem 2. Let Assumption 1 be satisfied. Let X <G (0,°°). Then 

sup \E M h(V?i(f Lfx -f LjPnA ))-m(S' P (G))\^0, (7) 

heBU(H) 

E M / 2 (^(4p,,, A -/L,P„a))*-EM/<V^(/ L ,P n ,A-/L.P„,A))* "> °' (8) 

converge in outer probability, where G « a tt'gfe Borel-measurable Gaussian pro- 
cess, S'p is a continuous linear operator with 

S' P (Q) = -K P l (E Q (L'(X,Y,f L ^(X))cp(X))), Qe4(^xf) (9) 

and 

K P :H^H, f^2Xf + E P (L"(X,Y,f LtPiX (X))f(X)<P(X)) (10) 
is a continuous linear operator which is invertible. 

For details on Kp, S' P , and G we refer to Lemma 1, Theorem 6, and Lemma 2. 



4 Proofs 

4.1 Tools for the proof of Theorem 2 

We will need two general results on bootstrap methods proven in [13] and adopt 
their notation, see [13, Chapters 3.6 and 3.9]. Let P„ be the empirical measure of an 
i.i.d. sample Z\ , . . . Z„ from a probability distribution P. The empirical process is the 
signed measure 

G„ = VH(P„-P). 

Given the sample values, let Z\ , . . . ,Z„ be an i.i.d. sample from P„. The bootstrap 
empirical distribution is the empirical measure P„ := £" =1 8^ , and the bootstrap 
empirical process is 

G„ = V^(P„ - P„) = Y(M ni - \)8 Z , , 
V" 1=1 

where M„, is the number of times that Z, is "redrawn" from the original sample 
Z\ , . ..Z n , M := (M„i , . . . ,M n „) is stochastically independent of Zj , . . . ,Z„ and multi- 
nomially distributed with parameters n and probabilities ~, . . . , i. If outer expecta- 
tions are computed, stochastic independence is understood in terms of a product 
probability space. Let Z\ ,Z2, . . . be the coordinate projections on the first °o coor- 
dinates of the product space P°°) x ( J^^Q) and let the multinomial 
vectors M depend on the last factor only, see [13, p. 345f]. 

The following theorem shows (conditional) weak convergence for the empirical 
bootstrap, where the symbol ~-* denotes the weak convergence of finite measures. 
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We will need only the equivalence between (/) and (Hi) from this theorem and list 
part (z'z) only for the sake of completeness. 

Theorem 3 ([13, Thm. 3.6.2, p. 347]). Let & be a class of measurable functions 
with finite envelope function. Define Y„ := n — 1 ' 2 £"_j(Mjvj i j — — P)- The fol- 

lowing statements are equivalent: 

(i) & is Donsker and P*||/ - Vf\\% < °°; 

(ii) sup /lGBLl |Emjv/z(Y h ) — E/i(G)| converges outer almost surely to zero and the 
sequence Em,a?/j(Y«)* — Emjv/j(Y„)* converges almost surely to zero for every 
h G BL[. 

(Hi) sup^ GBLl |Ejif/z(G n ) — Eft(G) converges outer almost surely to zero and the se- 
quence Em/i(G„)* — Eaj/z ((&,,)* converges almost surely to zero for every h G 
BLi. 

//ere f/ie asterisks denote the measurable cover functions with respect to M, N, and 
Z\,Z2, ■ ■ .jointly. 

Consider sequences of random elements P„ = W n (Z„) and P„ = P„(Z„,M„) in a 
normed space D such that the sequence s/n(F n — P) converges unconditionally and 
the sequence y/n(P„ — P„) converges conditionally on Z„ in distribution to a tight 
random element G. A precise formulation of the second assumption is 

sup |E M /j(Vn(P n -P„))-Efc(G)| ->-0, (11) 

/iSBLi (B) 

E M /z(V^(P„-P„))*-E M ft(^(P„-P„))^0, (12) 

in outer probability, with /? ranging over the bounded Lipschitz functions, see [13, 
p. 378, Formula (3.9.9)]. The next theorem shows that under appropriate assump- 
tions, weak consistency of the bootstrap estimators carries over to any Hadamard- 
differentiable functional in the sense that the sequence of "conditional random laws" 
(given Zi ,Z2, . . .) of y/n((j) (P„) — (j) (P n )) is asymptotically consistent in probability 
for estimating the laws of the random elements ^/n(<p(P„) — <p(P)), see [13, p. 378]. 

Theorem 4 ([13, Thm. 3.9.11, p. 378]). (Delta-method for bootstrap in probability) 
Let D and E be normed spaces. Let (j) : C ID) — > E be Hadamard-differentiable 
at P tangentially to a subspace Dq- Let P„ and P„ be maps as indicated previously 
with values in such that G„ := ^fn(W n — P) ~-> G and that (11)-(12) holds in outer 
probability, where G is separable and takes its values in Orj. Then 

sup |E M /z(V^(0(P n )-0(P n )))-EA(^(G))|->O, (13) 

/iGBLi(E) 

E M h (Pn ) - (P„)) ) * - E M A ( (P„ ) - (P„) )) , -> 0, (14) 
ftoWi ;« oMfer probability. 
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As was pointed out by [13, p. 378], consistency in probability appears to be sufficient 
for (many) statistical purposes and the theorem above shows this is retained under 
Hadamard differentiability at the single distribution P. 

We now list some results from [8], which will also be essential for the proof of 
Theorem 2. 

Theorem 5 ([8, Theorem 3.1]). Let Assumption 1 be satisfied. Then, for every regu- 
larizing parameter Ao E (0,°°), there is a tight, Borel-measurable Gaussian process 
H : Q H, (O -> H(fl)), such that 

-/lp,Ao)^H inH (15) 

for every Borel-measurable sequence of random regularization parameters Ad„ 
with y/n(Xo n — Ao) — > in probability. The Gaussian process H is zero-mean; i.e., 
E(/,IE% = Ofor every f e H. 

Lemma 1 ([8, Lemma A.5]). For every F EBs defined later in (25), 

Kf : H H, f^2^f + jL"(x,y,f LAF)M (x))f(x)4>{x)dl(F){x,y) (16) 
is a continuous linear operator which is invertible. 

Theorem 6 ([8, Theorem A.8]). For every Fq E B s which fulfills F (b) < E P (b) + 
Ao, the map S : Bs —> H, F i— >• f%ip\, is Hadamard- differentiable in Fq tangentially to 
the closed linear span Bq = cl(lin(Z?s)). The derivative in Fq is a continuous linear 
operator S Fq : Bq — > H such that 

S' Fo (G) = -K Fo l (E l(G) (L\XJJ LAFo)M (X))<P(X))), VG E lin(2fc). (17) 

Lemma 2 ([8, Lemma A.9]). For every data set D„ = {{x\ ,yi), . . . , (x n ,y„)) E 
{SI? X '&y i > ^ Ed„ denote the element of '£<*,{&) which corresponds to the empirical 
measure P„ := Pz>„. That is, Fj3„(g) = / gdP„ = n~ l Y,"=ig{xi,yi) for every g E 
Then 

VH(F D „-i- 1 (P))^G inl„{&), (18) 

where G : Q — > £o°{@) is a tight Borel-measurable Gaussian process such that 
G{co) E Bo for every co E £2. 



4.2 Proof of Theorem 2 

The proof relies on the application of Theorem 4. Hence, we have to show the fol- 
lowing steps: 

1. The empirical process G„ = s/n{¥ n — P) weakly converges to a separable Gaus- 
sian process G. 
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2. SVMs are based on a map which is Hadamard differentiable at P tangentially 
to some appropriate subspace. 

3. The assumptions (1 1)-(12) of Theorem 4 are satisfied. For this purpose we will 
use Theorem 3. Actually, we will show that part (i) of Theorem 3 is satisfied 
which gives the equivalence to part (Hi), from which we conclude that (1 1)-(12) 
hold true. For the proof that part (i) of Theorem 3 is satisfied, i.e., that a suitable 
set & is a P-Donsker class and that P*||/ — P/|||? < °°, we use several facts 
recently shown by [8]. 

4. We put all parts together and apply Theorem 4. 

Step 1. To apply Theorem 4, we first have to specify the considered spaces D, 
E, D^, Do an d the ma P 0- As in [8] we use the following notations. Because L is 
a P-square-integrable Nemitski loss function of order p £ [1,°°), there is a function 
b£L 2 (P) such that 

\L(x,y,t)\<b(x,y) + \t\ p , (^)efxfxl, (19) 

Let 

c := + ( 2 °) 

Define 

^:=SfiUSf 2 USf 3 , (21) 

where 

<( 



fi := {g : x W -> R : 3z £ R' l+l such that g = I^A (22) 



is the set of all indicator functions 7> 



3f £H,3f£H such that \\f \\ H < c , 
\\f\\ H < hg{x,y)=L'(x,y,f (x))f(x) V(x,y) 



(23) 



and 

Sf 3 := {b}. (24) 

Now let £„o($#) be the set of all bounded functions F : Sf — > R with norm ||F||oo = 
sup g6Sf |F(g)|. Define 



Be := < F : Sf ^ R 



3 /X 7^ a finite measure on if x f such that 

F(g) = fgdnVg£&, } (25) 

fceL 2 (M),^eL 2 (M)Vae(0,oo) 



and 

B :=cl(lin(B s )) (26) 

the closed linear span of B$ in &„(Sf). That is, B^ is a subset of &<>(Sf) whose ele- 
ments correspond to finite measures. Hence probability measures are covered as spe- 
cial cases. The elements of B$ can be interpreted as some kind of generalized distri- 
butions functions, because $#i C Sf . The assumptions on L and P imply that — > R, 
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8 ^ / gdP is a well-defined element of B$. For every F £ B$, let i(F) denote the 
corresponding finite measure on (JT x W ,8${2£ x W)) such that F(g) = f gdfi 
for all g £ <S . Note that the map I is well-defined, because by definition of B$, i(F) 
uniquely exists for every F £ B$. 

With these notations, we will apply Theorem 4 for 

D := &o(Sf), E:=H(= RKHS of the kernel k), 
B^:=B S , 3o:=B :=cl(]m(B s )), 

A e(0,-), (27) 
<j> := S, S:B S ^H 7 F^ f l{F) := f L , l{F ),Xo ■ = 

aigmf fe HjL(x,y,f(x))dl(F)(x,y)+Xo\\f\\ 2 H ■ 

At first glance this definition of S seems to be somewhat technical. However, this 
will allow us to use a functional delta method for bootstrap estimators of S VMs with 
regularization parameter A = Ao £ (0,°°). 

Lemma 2 guarantees that the empirical process G„ := \fn{¥ n — P) weakly con- 
verges to a tight Borel-measurable Gaussian process. 

Since a <7-compact set in a metric space is separable, separability of a random 
variable is slightly weaker than tightness, see [13, p. 17]. Therefore, G in our Theo- 
rem 2 is indeed separable. 

Step 2. Theorem 6 showed that the map S indeed satisfies the necessary 
Hadamard-differentiability in the point P := i _1 (F). 

Step 3. We know that Sf is a P-Donsker class, see Lemma 2. Hence, an immediate 
consequence from [13, Theorem 3.6.1, p. 347] is, that 

sup \E M h(G„)-Eh(G)\ (28) 

converges in outer probability to zero and (&„ is asymptotically measurable. 

However, we will prove a somewhat stronger result, namely that is a P-Donsker 
class and P*||g — Pg\\% < °°, which is part (i) of Theorem 3, and then part (Hi) of 
Theorem 3 yields, that the term in (28) converges even outer almost surely to zero 
and the sequence 

E M h(G n )* -E M h(&„)* (29) 

converges almost surely to zero for every h £ BLi . 

Because ^ is a P-Donsker class, it remains to show that P* ||g — Pg|||? < °°. Due 

to 

P*||*-P*|&:= /(sup|s-E P (g)|) 2 rfP* (30) 
and & = U^2 U^3, we obtain the inequality 
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p* Ik - Mi < p* (g 2 +2\ g \. p\ g \ + (pud 2 ) 

ge& 

< P* supg 2 + 2P* sup(|g| -P\g\) + sup(P|g|) 2 

g£& g£& g£& 

< £(p*sups 2 +2P*sup(|s|-P|g|)+sup(P|g|) 2 ). (31) 

j=l K g&$j g€&j 



We will show that each of the three summands on the right hand side of the last 
inequality is finite. If g £ then g equals the indicator function /(_«, !Z ] f° r some z £ 
]R d+1 . Hence, = 1 and the summand for j = 1 is finite. If g £ ^3, then g = b £ 
L2 (P) because L is by assumption a P-square-integrable Nemitski loss function of 
order p £ [1 , °°) . Hence the summand for j = 3 is finite, too. Let us now consider the 
case that g £ ^2- By definition of ^ f° r every g £ <$2 there exist /,/n £H such that 
\\fo\\ H < c o, 11/11// < 1, and g = U f J, where we used the notation (L' f j) (x,y) := 
L'(x,y,f (x))f(x) for all (x,y) £^x^. Using \\f\\„ < \\k\\„ \\f\\ H for every feH, 
we obtain 

||/o||h<co ||/o||~<co||*||- and < 1 ||/||» < ||*||. . (32) 

Define the constant a := co||fc||oo with cq given by (20). Hence, for all (x,y) £ 3£ x 



sup |L'(x,y,/oW)| 2 < sup sup \L'(x,y,t)\ 2 

fo£H;\\f \\ H <c f Q eH;\\f \\^<a t£[-a,+a] 

(6) 

< sup (b' a (x,y)) 2 . (33) 

/o€ff;||/o||~<a 



Hence we get 



P* sup g 2 

ge& 2 

sup \L'(x,y,f (x))f(x)\ 2 dP*(x,y) 

2;||/ollH<co,ll/lla<l>«=^/ / 

< / sup \L'(x,y,f (x))\ 2 sup |/(*)| 2 dP*(x,y) 

J f eH;\\f \\ H <c feH;\\f\\ H <l 
(33), (32) „ r , /• 

< ||^|| 2 3 /(0 2 ^P* = ll^lli/(0 2 ^P<-, 



because € ^(P) and ||&||oo < 00 by Assumption 1. With the same arguments we 
obtain, for every g £ $2, 
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P|*| < / sup \g\ dP 



< / sup \L'(x,y,f (x))\ sup \f(x)\dP*(x,y) 

J f eH;\\f \\ H <c feH;\\f\\ H <\ 

(33),(32) , 

< j b' a (x,y)\\k\\„dP*(x,y) 

< \\k\UJb' a d?«*>, 

because b' a € ^(P) and j|fcj|oo < °° by Assumption 1. Hence, 

p* sup(\g\p\g\)<\\k\\„[bf a dp f sup\ g \dp* ^wkwiifb'^dp) 2 <~. 



Therefore, the sum on the right hand side in (31) is finite and thus the assump- 
tion P*||g — Pg\\% < °° is satisfied. This yields by part (Hi) of Theorem 3 that 
sup /jGBL] |Em/i((&, 7 ) — Eft(G) converges outer almost surely to zero and the se- 
quence 

E M h(G„y -E M /i((&„)* (34) 

converges almost surely to zero for every h € BLi, where the asterisks denote the 
measurable cover functions with respect to M and Zi,Z2, . . . jointly. 

Step 4. Due to Step 3, the assumption (11) of Theorem 4 is satisfied. We now 
show that additionally (12) is satisfied, i.e., that the term in (34) converges to zero 
in outer probability. In general, one can not conclude that almost sure convergence 
implies convergence in outer probability, see [13, p. 52]. We know that the term in 
(34) converges almost surely to zero for every h € BLi, where the asterisks denote 
the measurable cover functions with respect to M and (X\,Y\), (X2,Y%), . . . jointly. 
Hence, for every h S BLi, the cover functions to be considered in (34) are measur- 
able. Additionally, the multinomially distributed random variable M is stochastically 
independent of (X\ , Y\ ) , . . . , (X n , Y n ) in the bootstrap, where independence is under- 
stood in terms of a product probability space, see [13, p. 346] for details. There- 
fore, an application of the Fubini-Tonelli theorem, see e.g., [4, p. 174, Thm. 2.4.10], 
yields that the inner integral E M h (y/n(F„ -P„))* -E M h(\/n(P„ — P B ))» considered 
by Fubini-Tonelli is measurable for every n£i and every h £ BL\. Recall that al- 
most sure convergence of measurable functions implies convergence in probability 
which is equivalent with convergence in outer probability for measurable functions. 
Hence we have convergence in outer probability in (34). Therefore, all assumptions 
of Theorem 4 are satisfied and the assertion of our theorem follows. ■ 
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